arXiv:1505.01687vl [math.ST] 7 May 2015 


Bayesian Analysis (2015) 


10, Number 2, pp. 351-377 


Scaling It Up: Stochastic Search Structure 
Learning in Graphical Models 


Hao Wang* 


Abstract. Gaussian concentration graph models and covariance graph models 
are two classes of graphical models that are useful for uncovering latent depen¬ 
dence structures among multivariate variables. In the Bayesian literature, graphs 
are often determined through the use of priors over the space of positive deh- 
nite matrices with fixed zeros, but these methods present daunting computational 
burdens in large problems. Motivated by the superior computational efficiency of 
continuous shrinkage priors for regression analysis, we propose a new framework 
for structure learning that is based on continuous spike and slab priors and uses 
latent variables to identify graphs. We discuss model specihcation, computation, 
and inference for both concentration and covariance graph models. The new ap¬ 
proach produces reliable estimates of graphs and efficiently handles problems with 
hundreds of variables. 

Keywords: Bayesian inference, Bi-directed graph. Block Gibbs, Goncentration 
graph models, Govariance graph models. Credit default swap. Undirected graph, 
Structural learning. 


1 Introduction 

Graphical models use graph structures for modeling and making statistical inferences 
regarding complex relationships among many variables. Two types of commonly used 
graphs are undirected graphs, which represent conditional dependence relationships 
among variables, and bi-directed graphs, which encode marginal dependence among 
variables. Structure learning refers to the problem of estimating unknown graphs from 
the data and is usually carried out by sparsely estimating the covariance matrix of the 
variables by assuming that the data follow a multivariate Gaussian distribution. Under 
the Gaussian assumption, undirected graphs are determined by zeros in the concentra¬ 
tion matrix and their structure learning problems are thus referred to as concentration 
graph models; bi-directed graphs are determined by zeros in the covariance matrix and 
their structure learning problems are thus referred to as covariance graph models. This 
work concerns structure learning in both concentration and covariance graph models. 

Glassical methods for inducing sparsity often rely on penalized likelihood approaches 
(Banerjee et ah, 2008; Yuan and Lin, 2007; Bien and Tibshirani, 2011; Wang, 2014). 
Model fitting then uses deterministic optimization procedures such as coordinate de¬ 
scents. Thresholding is another popular method for the sparse estimation of covariance 
matrices for covariance graph models (Bickel and Levina, 2008; Rothman et ah, 2009); 
however, there is no guarantee that the resulting estimator is always positive definite. 
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Bayesian methods for imposing sparsity require the specification of priors over the space 
of positive definite matrices constrained by fixed zeros. Under such priors, model deter¬ 
mination is then carried out through stochastic search algorithms to explore a discrete 
graphical model space. The inherent probabilistic nature of the Bayesian framework 
permits estimation via decision-theoretical principles, addresses parameter and model 
uncertainty, and provides a global characterization of the parameter space. It also en¬ 
courages the development of modular structures that can be integrated with more com¬ 
plex systems. 

A major challenge in Bayesian graphical models is their computation. Although 
substantial progress in computation for these two graphical models has been made 
in recent years, scalability with dimensions remains a significant issue, hindering the 
ability to adapt these models to the growing demand for higher dimensional problems. 
Recently published statistical papers on these two graphical models either focus on 
small problems or report long computing times. In concentration graph models. Dobra 
et al. (2011) report that it takes approximately 1.5 days to fit a problem of 48 nodes on 
a dual-core 2.8 Ghz computer under C; Wang and Li (2012) report approximately two 
days for a 100 node problem under MATLAB; and Cheng and Lenkoski (2012) report 
a computing time of 1-20 seconds per one-edge update for a 150 node problem using 
a 400 core server with 3.2 GHz CPU under R and C-I--I-. In covariance graph models, 
Silva and Ghahramani (2009) ht problems up to 13 nodes and conclude that “further 
improvements are necessary for larger problems.” 

To scale up with dimension, this paper develops a new approach called stochastic 
search structure learning (SSSL) to efficiently determine covariance and concentration 
graph models. The central idea behind SSSL is the use of continuous shrinkage priors 
characterized by binary latent indicators in order to avoid the normalizing constant 
approximation and to allow block updates of graphs. The use of continuous shrink¬ 
age priors contrasts point-mass priors at zeros that are used essentially by all existing 
methods for Bayesian structure learning in these two graphical models. 

The motivation for the SSSL comes from the successful developments of continuous 
shrinkage priors in several related problems. In regression analysis, continuous shrinkage 
priors were used in the seminal paper by George and McGulloch (1993) in the form of 
a two component normal mixture for selecting important predictors and these priors 
have recently garnered substantial research attention as a computationally attractive 
alternative for regularizing many regression coefficients (e.g.. Park and Casella 2008; 
Griffin and Brown 2010; Armagan et al. 2013). In estimation of covariance matrices, they 
are used for regularizing concentration elements and have been shown to provide fast and 
accurate estimates of covariance matrices (Wang, 2012; Khondker et al., 2013). In factor 
analysis, they are used instead of point-mass priors (Garvalho et al., 2008) for modeling 
factor loading matrices, efficiently handling hundreds of variables (Bhattacharya and 
Dunson, 2011). 

Nevertheless, the current work is fundamentally different from the aforementioned 
works. The research focus here is the structure learning of graphs, which is distinct 
from regression analysis, factor analysis, and the pure covariance estimation problem 
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that solely performs parameter estimation without the structure learning of graphs. Al¬ 
though continuous shrinkage priors generally perform very well in these problems, little 
is known about their performance in problems of structure learning. Because graphs are 
directly determined by covariance matrices, the positive definiteness of any covariance 
matrices poses methodological challenges to investigating prior properties, as well as to 
the construction of efficient stochastic search algorithms. The paper’s main contribu¬ 
tions are the development and exploration of two classes of continuous shrinkage priors 
for learning undirected and bi-directed graphs, as well as two efficient block Gibbs sam¬ 
plers for carrying out the corresponding structure learning tasks that fit problems of 
one or two hundred variables within a few minutes. 


2 Background on graphical models 

Assume that y = (yi,y 2 , ■ • ■ jl/pY is a p-dimensional random vector following a multi¬ 
variate normal distribution N(0, S) with mean of zero and covariance matrix S = (tTij). 
Let fl = (ujij) = be the concentration matrix. Covariance and concentration graph 
models are immediately related to S and O, respectively. Let Y be the n x p data 
matrix consisting of n independent samples of y and let S = Y'Y. The theory and 
existing methods for structure learning are briefly reviewed in the next two sections. 

2.1 Concentration graph models 

Concentration graph models (Dempster, 1972) consider the concentration matrix fl 
and encode conditional dependence using an undirected graph G = (V, A), where V = 
{1,2,... ,p} is a non-empty set of vertices and E C {{i,j) ■ i < j} is a set of edges 
representing unordered pairs of vertices. The graph G can also be indexed by a set of 
p{p — l)/2 binary variables Z = where = 1 or 0 according to whether edge 

(i,j) belongs to E and not. Theoretically, the following properties are equivalent: 

Zij = 0 (bj) ^ ^ Ui -U- Dj I y—{ij) ^ij ~ 0; 

where y-(ij) is the random vector containing all elements in y except for pi and yj, and 
reads as “if and only if”. 

In the Bayesian paradigm, concentration graph models are usually modeled through 
hierarchical priors consisting of the following: (i) the conjugate G-Wishart prior ~ 
Wg(6, D) (Dawid and Lauritzen, 1993; Roverato, 2002) for Jl given the graph Z; and 
(ii) independent Bernoulli priors for each edge-inclusion indicator ztj with inclusion 
probability tt: 

p(n\Z) = /Gw(&,D,Z)~i|S7|^exp{-itr(DfJ)}l^fjeM+(z)}> (1) 

piZ) = (2) 

i<j 

where b is the degrees-of-freedom parameter, D is the location parameter, law ip, D, Z) 
is the normalizing constant, and M~^iZ) is the cone of symmetric positive definite 


354 


Stochastic Search Structure Learning In Graphical Models 


matrices with ofF-diagonal entries Wy = 0 whenever Zij = 0. As for the hyperparam¬ 
eters, common choices are 6 = 3, D = Ip with Ip being the p x p identity matrix and 
TT = 2j{p— 1) (Jones et ah, 2005). Under (l)-(2), some methods (e.g., Jones et al. 
2005; Scott and Carvalho 2008; Lenkoski and Dobra 2011) learn Z directly through its 
posterior distribution over the model space p{Z \ Y) oc p{Y \ Z)p(Z). Other methods 
learn Z indirectly through sampling over the joint space of graphs and concentration 
matrices p{Ll, Z | Y) (Giudici and Green, 1999; Dobra et ah, 2011; Wang and Li, 2012). 
Regardless of the types of algorithms, two shared features of these methods cause the 
framework (l)-(2) to be inefficient for larger p problems. The first of these features 
is that graphs are updated in a one-edge-at-a-time manner, meaning that sweeping 
through all possible edges requires a loop of 0{p^) iterations. The second feature is 
that the normalizing constant lGw{b,L),Z) for non-decomposable graphs requires ap¬ 
proximation. The commonly used Monte Carlo approximation proposed by Atay-Kayis 
and Massam (2005) is not only unstable in some situations but also requires a matrix 
completion step of time complexity 0{Mp‘^) for M Monte Carlo samples, making these 
methods unacceptably slow in large graphs. Recent works by Wang and Li (2012) and 
Cheng and Lenkoski (2012) propose the use of exchange algorithms to avoid the Monte 
Carlo approximation. However, the computational burden remains daunting; empirical 
experiments in these papers suggest it would take several days to complete the fitting 
for problems of p Ri 100 on a desktop computer. 

In the classical formulation, concentration graphs are induced by imposing a graphi¬ 
cal lasso penalty on Lt in order to encourage zeros in the penalized maximum likelihood 
estimates of fl (e.g.. Yuan and Lin 2007; Friedman et al. 2008; Rothman et al. 2008). 
In particular, the standard graphical lasso problem is to maximize the penalized log- 
likelihood 

g 

log(det H) — tr(—H) — p||r2||i, 
n 

over the space of positive definite matrices M’*', with p > 0 as the shrinkage parameter 
and ||r2||i = \^ij\ the Li-norm of LI. The graphical lasso problem has a 

Bayesian interpretation (Wang, 2012). Its estimator is equivalent to the maximum a 
posteriori estimation under the following prior for LI: 

p{n) = C-^ {exp(-A|wy|)}lfjgj^^+, (3) 

l<i.j<p 


where C is the normalizing constant. By exploiting the scale mixture of normal repre¬ 
sentation, Wang (2012) shows that fitting (3) is very efficient using block Gibbs samplers 
for up to the lower hundreds of variables. 

A comparison between the two Bayesian methods (l)-(2) and (3) helps to explain the 
intuition behind the proposed SSSL. Model (l)-(2) explicitly treats a graph Z as an un¬ 
known parameter and considers its posterior distribution, which leads to straightforward 
Bayesian inferences about graphs. However, it is slow to run due to the one-edge-at-a- 
time updating and the normalizing constant approximation. In contrast. Model (3) uses 
continuous priors, enabling a fast block Gibbs sampler that updates LI one column at 
a time and avoids normalizing constant evaluation. However, no graphs are used in the 
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formulation, and thus this approach does not constitute a formal Bayesian treatment of 
structure learning. Still, a better approach might be developed by using the best aspects 
of the two methods. That is, such a method would allow explicit structure learning, as 
in (l)-(2), while maintaining good scalability, as in (3). This possibility is exactly the 
key of SSSL. Similar insights also apply to the covariance graph models described below. 

2.2 Covariance graph models 

Covariance graph models (Cox and Wermuth, 1993) consider the covariance matrix S 
and encode the marginal dependence using a bi-directed graph G = (V,E), where each 
edge has bi-directed arrows instead of the full line used by an undirected graph. Similar 
to concentration graph models, the covariance graph G can also be indexed by binary 
variables Z = {zij)i^j. Theoretically, the following properties are equivalent: 

^ij — 0 (bj) ^ ^ Vi Uj ^ij ~ 

In the Bayesian framework, structure learning again relies on the general hierarchical 
priors p(S, Z) = p(r2 | Z)p(Z). For p(S | Z), Silva and Ghahramani (2009) propose the 
conjugate G-inverse Wishart prior S ^ IWg(&, D) with the density as: 

p(S|Z) = /e,V(6,D,Z)|Sr^exp{-itr(DS-i)}lsgM+(z)^ (4) 

where b specifies the degrees of freedom, D is the location parameter, and loiwib, D, Z) 
is the normalizing constant. Structure learning is then carried out through the marginal 
likelihood function p(Y | Z) = {2Tr)-^P/'^lGiwib+ n,T> + S,Z)/lGiwib,T>,Z). Unfor¬ 
tunately, the key quantity of the normalizing constant lGiw{b, D, Z) is analytically in¬ 
tractable, even for decomposable graphs. Silva and Ghahramani (2009) propose a Monte 
Carlo approximation via an importance sampling algorithm, which becomes computa¬ 
tionally infeasible for p beyond a few dozen. Their empirical experiments are thus lim¬ 
ited to small problems (i.e., p < 20). Later, Khare and Rajaratnam (2011) investigate 
a broad class of priors for decomposable covariance graph models that embed (4) as 
a special case. They also derive closed-form normalizing constants for decomposable 
homogeneous graphs which account for only a tiny portion of the overall graph space. 
Despite these advances, the important question of scalability to higher dimensional 
problems remains almost untouched. 

In the classical framework, the earlier literature focuses on maximum likelihood 
estimates and likelihood ratio test procedures (e.g., Kauermann 1996; Wermuth et al. 
2006; Chaudhuri et al. 2007). Later, two general types of approaches are proposed 
to estimate zeros in the covariance elements. The first is the thresholding procedure, 
which sets Uij to be zero if its sample estimate is below a threshold (Bickel and Levina, 
2008; Rothman et ah, 2009; Cai and Liu, 2011). Another approach is motivated by the 
lasso-type procedures. Bien and Tibshirani (2011) propose a covariance graphical lasso 
procedure for simultaneously estimating covariance matrix and marginal dependence 
structures. Their method is to minimize the following objective function: 

log(det S) -I- tr(—S~^) -f p||S||i, 
n 


( 5 ) 
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over the space of positive definite matrices M~^, with p > 0 as the shrinkage parameter. 
In comparison with thresholding and likelihood ratio testing methods, this approach has 
the advantage of guaranteeing the positive definiteness of the estimated S. Although a 
Bayesian version of (5) has not been explored previously, its derivation is straightforward 
through the prior 


p(S) = C* ^ n {®^P(-^l^bl)}lSeM+’ (6) 


In light of the excellent performance of the Bayesian concentration graphical lasso (3) 
reported in Wang (2012), we hypothesize that (6) shares similar performances. In fact, 
we have developed a block Gibbs sampler for (6) and found that it gives a shrinkage 
estimation of S and is computationally efficient, although it provides no explicit treat¬ 
ment of the graph Z. The detailed algorithm and results are not reported in this paper 
but are available upon request. Comparing (4) and (6) suggests that again, the different 
strengths of (4) and (6) might be combined to provide a better approach for structure 
learning in covariance graph models. 


3 Continuous spike and slab priors for positive definite 
matrices 

Let A = {aij)pxp denote a p-dimensional covariance or concentration matrix; that is, 
A = S or r2. SSSL uses the following new prior for A: 

p(A) = {G(6»)}"^ j(l - 7r)N(aij | 0, Vq)- f 7rN(aij | 0, a?)| Exp(aii | ^)l{AeM+}, (7) 

where N(a | is the density function of a normal random variable with mean 0 

and variance evaluated at a, Exp(a | A) represents the exponential density function 
of the form p(a) = Aexp(—Ax)Ia> 0 i and Ipj is the indicator function. The parameter 
spaces are wq > 0, > 0, A > 0 and tt G (0,1), and the set of all parameters is 

denoted as 0 = {uq, ui, tt. A}. The values of vq and vi are further set to be small and 
large, respectively. The term C{0) represents the normalizing constant that ensures the 
integration of the density function p(A) over the space is one, and it depends on 9. 
The first product symbol multiplies p(j>— l)/2 terms of two-component normal mixture 
densities for the off-diagonal elements, connecting this prior to the classical and Bayesian 
graphical lasso methods through the familiar framework of normal mixture priors for . 
The second product symbol multiplies p terms of exponential densities for the diagonal 
elements. The two-component normal mixture density plays a critical role in structure 
learning, as will be clear below. 

Prior (7) can be defined by introducing binary latent variables Z = £ Z = 

{0, and a hierarchical model: 
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( 8 ) 


i<j i 


p(z I 9) = {C{e)}-^C{Z,vo,vuX)l[{7T^^^{l-7Ty-^'^}, 


(9) 


i<j 


where = vq or vi if Zij = 0 or 1. The intricacy here is the two terms of C(Z, no, fi, A). 
Note that C(Z, no, A) G (0,1) because it is equal to the integration of the product of 
normal densities and exponential densities over a constrained space M+. Thus, (8) and 
(9) are proper distributions. The joint distribution of (A, Z) acts to cancel out the two 
terms of C'(Z, no, ni, A) and results in a marginal distribution of A, as in (7). 

The rationale behind using Z for structure learning is as follows. For an appropriately 
chosen small value of vq , the event Zij = 0 means that comes from the concentrated 
component N(0,nQ), and so is likely to be close to zero and can reasonably be 
estimated as zero. For an appropriately chosen large value of ni, the event Zij = 1 
means that aij comes from the diffuse component N(0,Uj) and so Uij can be estimated 
to be substantially different from zero. Because zeros in A determine missing edges in 
graphs, the latent binary variables Z can be viewed as edge-inclusion indicators. Given 
data Y, the posterior distribution of Z provides information about graphical model 
structures. The remaining questions are then how to specify parameters 0 and how to 
perform posterior computations. 

3.1 Choice of tt 

From (9), the hyperparameter tt controls the prior distribution of the edge-inclusion 
indicators in Z. The choice of tt should thus reflect the prior belief about what the 
graphs will be in the final model. In practice, such prior information is often summarized 
via the marginal prior edge-inclusion probability Pr(zjj = 1). Specifically, a prior for 
Z is chosen such that the implied edge-inclusion probability of edge {i,j) meets the 
prior belief about the chance of the existence of edge (*, j). For example, the common 
choice PT^Zij = 1) = 2/(p — 1) reflects the prior belief that the expected number of 
edges is approximately ( 2 ) = 1) = p. Another important reason that Pr(zij = 1) 

is used for calibrating priors over Z is that the posterior inference about Z is usually 
based upon the marginal posterior probability of PT:{zij = 1 | Y). For example, the 
median probability graph, the graph consisting of those edges whose marginal edge- 
inclusion probability exceeds 0.5, is often used to estimate G (Wang, 2010). Focusing 
on the marginal edge-inclusion probability allows us to understand how the posterior 
truly responds to the data. 

Calibrating tt according to Pr)^^ = 1) requires knowledge of the relation between 
Pr{zij = 1) and tt. From (9), the explicit form of Pr(zij = 1) as a function of tt is 
unavailable because of the intractable term C(Z, uq, ^ 1 , A). A comparison between (9) 
and (2) helps illustrate the issue. Removing C'(Z,uo,ui, A) from (9) turns it into (2) but 
will not cancel out the term C{Z,vq,vi, X) in (8) for the posterior distribution of Z. 
Tedious and unstable nnmerical integration is then necessary to evaluate C(Z, uq, ui. A) 
at each iteration of sampling Z. Inserting C{Z,vo,vi, X) into (9) cancels C{Z,vo,vi, X) 
in (8) in the posterior, thus facilitating computation, yet concerns might be raised 
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about such a “fortunate” cancelation. For example, Murray (2007) notes that a prior 
that cancels out an intractable normalizing constant in the likelihood would depend 
on the number of data points and would also be so extreme that it would dominate 
posterior inferences. These two concerns appear to be not problematic in our case. The 
prior (9) is for the hyperparameter Z, rather than for the parameter directly involved 
in the likelihood; thus it does not depend on sample size. Instead, the prior also only 
introduces mild bias without dominating the inferences, as shown below. 

To investigate whether the cancelation of C(Z, uq, A) causes the prior to be too 
extreme, we compute = 1) numerically from Monte Carlo samples generated 

by the algorithm in Section 4.1. In (8)-(9), we first fix tt = 2/(p— l),i'o = 0.05, and 
A = 1, and then vary the dimension p G {50,100,150,200,250} and vi = hvQ with 
h G {10,50,100}. Panel (a) of Figure 1 displays these estimated Py{zij = 1) as a 
function of p for different h values. As a reference, the curve Pr{zij = 1) = 2l{p — 1) 
is also plotted. The most noticeable pattern is that all three curves representing the 
implied PT{zij = 1) from (9) are below the reference curve, suggesting that there is 
a downward bias introduced by C(Z,vo,vi, X) on Pr(zij = 1). The bias is introduced 
by the fact that the positive definite constraint on A favors a small vq, specified by 
Zij = 0, over a large vi, specihed by Zij = 1. We also see that the bias is larger at larger 
values of h, at which the impact of positive definite constraints is more significant. 
Next, we fix p = 150, h = 50, and A = 1 and vary vq G {0.02,0.05,0.1} and tt G 
{2/(p — 1), 4/(p — 1), 6/(p — 1), 8/(p — 1), 10/(p — 1)}. Panel (b) of Figure 1 displays 
these implied Pr(zij = 1) as a function of tt for different vg values. Again, as a reference, 
the curve Pr(zij = 1) = tt is plotted. The downward bias of Pr(zij = 1) relative 
to TT continues to exist and is larger at larger values of vg or tt because the positive 
definite constraint on A forces Zij = 0 to be chosen more often when vg or tt is large. 


(a) (b) 



Figure 1: The implied prior marginal edge-inclusion probability Pr(zij = 1) from the 
prior (8)-(9) as a function of p at different h (left) and a function of tt at different 
Vg (right), together with two reference curves of Pr(% = 1) = 2/{p — 1) (left) and of 
Pr^Zij = 1) = TT (right). 
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Figure 2: The univariate density of p{aij \ Zij) for different values of h and vg. 


Nevertheless, the downward bias seems to be gentle, as Pr(zij = 1) is never extremely 
small; consequently the prior (8)-(9) is able to let the data reflect the Z if the likelihood 
is strong. 

Another concern is that the lack of analytical relation between Pr(zij = 1) and tt 
might raise challenges against the incorporation of prior information about Pr(zij = 1) 
into TT. This problem can be side-stepped by prior simulation and interpolation. Take a 
p = 150 node problem as an example. If the popular choice Pr(zij = 1) = 2/(p — 1) = 
0.013 is desirable, interpolating the points (7r,Pi(zij = 1)) in Panel (b) of Figure 1 
suggests that tt should be set approximately at 0.018, 0.027, and 0.048 for vg = 0.02, 
0.05, and 0.1, respectively. Our view is that obtaining these points under prior (8)- 
(9) is much faster than evaluating C(Z,vg,vi, A) at each configuration of Z under the 
traditional prior (2). 


3.2 Choice of Vq and Vi 

From (8), the choice of vg should be such that if the data supports Zij = 0 over = 1, 
then Uij is small enough to be replaced by zero. The choice of vi should be such that, 
if the data favor Zy = 1 against z^ = 0, then can be accurately estimated to 
be substantially different from zero. One general strategy for choosing vg and vi, as 
recommended by George and McCulloch (1993), is based on the concept of practical 
significance. Specifically, suppose a small S can be chosen such that \aij\ < S might be 
regarded as practically insignificantly different from zero. Incorporating such a prior 
belief is then achieved by choosing vg and ui, such that the density p{aij \ Zij = 0) is 
larger than the density p{aij \ Zij = 1), precisely within the interval (—(5, 5). An explicit 
expression of vg as a function of S and h can be derived when p{aij \ Zij) are normal. 
However, the implied distribution p{aij \ Zij) from (8)-(9) is neither normal nor even 
analytically tractable. A numerical study will illustrate some aspects of p{aij \ Zij). 

Figure 2 draws the Monte Carlo estimated density of p{aij \ Zij = 0) and p{aij \ 
Zij = 1) for several settings of vg and h. In all cases, there is a clear separation between 
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p{aij I Zij = 0)andp(ay | Zij = 1), with a largerresulting in a sharper separation. This 
property of separation between a small and a large variance component is the essence 
of the prior for structural learning that aims to separate small and large a^s. Clearly, 
the marginal densities are no longer normal. For example, the density of p{aij \ Zij = 0) 
is more spiky than that of N(0, vq); the difference between p{aij \ Zij = 1) when h = 50 
and h = 100 is less clear than the difference between N(0,2500'yo) and N(0, IOOOOuq). 
The lack of an explicit form oip{aij \ Zij) makes the strategies of analytically calculating 
Vq from the threshold 6 infeasible. Numerical methods that estimate p(aij | Zij) from 
Markov chain Monte Carlo (MCMC) samples might be used to choose vq according to 
6 from a range of possible values. 

Another perspective is that, when vq is chosen to be very small and h is chosen 
to be very large, the prior for Qij is close to a point-mass mixture that selects any 
Uij 0 as an edge. Because the point-mass mixture prior provides a noninformative 
method of structure learning when the threshold 6 cannot be meaningfully specified, 
it makes sense to choose vg to be as small as possible, but not so small that it could 
cause MCMC convergence issues, and to choose vi to allow for reasonable values of a^-. 
In our experiments with standardized data, MCMC converges quickly and mixes quite 
well, as long as vg > 0.01 and h < 1000. 

3.3 Choice of A 

The value of A controls the distribution of the diagonal elements an. Because the data 
are usually standardized, a choice of A = 1 assigns substantial probability to the entire 
region of plausible values of an, without overconcentration or overdispersion. From 
our experience, the data generally contain sufficient information about the diagonal 
elements, and the structure learning results are insensitive to a range of A values, such 
as A = 5 and 10. 


4 Fast block Gibbs samplers 

The primary advantage of the SSSL prior (8)-(9) over traditional approaches is its 
scalability to larger p problems. The reduction in computing time comes from two 
improvements. One is that (8)-(9) enable block updates of all p{p— l)/2 edge-inclusion 
indicators in Z simultaneously, while other methods only update one edge-inclusion 
indicator Zy at a time. The other is that there is no need of a Monte Carlo approximation 
of the intractable constants, while all of the other methods require some sort of Monte 
Carlo integration to evaluate any graph. The general sampling scheme for generating 
posterior samples of graphs is to sample from the joint distribution p(A, Z | Y) by 
iteratively generating from p{A \ Z, Y) and p(Z | A, Y). The first conditional p{A \ 
Z, Y) is sampled in a column-wise manner and the second conditional p(Z | A, Y) is 
generated all at once. The details depend on whether A = for concentration graph 
models or A = S for covariance graph models, and they are described below. 
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Consider concentration graph models with A = 17 in the hierarchical prior (8)-(9). 
To sample from p(fl | Z, Y), the following proposition provides necessary conditional 
distributions. The proof is in the Appendix. 

Proposition 1. Suppose A = Q, in the hierarchical prior (8)-(9). Focus on the last 
column and row. Let V = ) be a p x p symmetric matrix with zeros in the diagonal 

entries and (v‘ij)i<j in the upper diagonal entries. Partition 17, S = Y'Y and V as 
follows: 


n = 


^22 


s = 


Sll, Si2 

Si2j 522 


V = 


Vii, Vi2 


' 12 i ' 


( 10 ) 


Consider a change of variables: ((.*.> 12 , W 22 ) —>■ (u = loi 2 ,v = W 22 ~ <^ 12 ^ 11 ‘^ 12 )- 
then have the full conditionals: 


(u I -) ~ N(-Csi 2 , C), and {v \ -) ^ Ga(^ + 1, 


( 11 ) 


where C = {(522 + A)17ii + diag(v ^2 )} 


Permuting any column to be updated to the last one and using (11) will lead to a 
simple block Gibbs step for generating (17 | Z, Y). For p(Z | 17, Y), prior (8)-(9) implies 
all Zij are independent Bernoulli with probability 


Pr(zy 


. I Q Y1 = _ N(a;,j- | 0,u^)7r _ 

N(a;ij I 0,vf)TT + N{uJij \ 0,ug)(l - tt) ' 


( 12 ) 


A closer look at the conditional distributions of the last column u = a;i 2 in ( 11 ) and 
the corresponding edge-inclusion indicator vector 7 = ( 71 ,..., jp-i)' = iz:ip ,..., Zp-i^p)' 
in (12) reveals something interesting. These distributions look like the Gibbs samplers 
used in the stochastic search variable selection (SSVS) algorithm (George and McCul¬ 
loch, 1993). Indeed, consider /3 = (/3i,...,/3p_i)' = —u and note that S 22 = n for 
standardized data. If r7](^^ = 7 S 11 and A = 0, then ( 11 ) implies that 


(/3|zi2,Y)~N 


{Sii -I- diag(v 12 ^)} ^si2,{Sii-bdiag(vi2^)}’ 


and ( 12 ) implies that 


Pr( 7 , = 1 I /3) 


_N(/3j I 0,vI)tt _ 

N(/3j I 0,uf)7r-bN(/3j | 0,ug)(l - tt)’ 


j = l,...,p-l, 


which are exactly the Gibbs sampler of SSVS for the p-th variable. Thus, the problem 
of SSSL for concentration graph models can be viewed as a p-coupled SSVS regression 
problem, as the use of r7(”]^ in the conditional distribution of a;i 2 in place of Sn shares 
information across p regressions in a coherent fashion. This interesting connection has 
not been noted elsewhere, to the best of our knowledge. 
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4.2 Block Gibbs samplers for covariance graph models 

Now, consider covariance graph models with A = S in the hierarchical prior (8)-(9). 
To sample from p(S | Z, Y), the following proposition provides necessary conditional 
distributions; its proof is in the Appendix. 

Proposition 2. Suppose A = H in the hierarchical prior (8)-(9). Focus on the last 
column and row. Let V = ) be a p x p symmetric matrix with zeros in the diagonal 

entries and in the upper diagonal entries. Partition S, S and V as follows: 


I Sii,cri2 \ 

V 0 -' i 2,<^22 ) 


( Sii,Si2 \ 

V si2,S22 } 


V = 


Vii,Vi2 \ 

Vi2,0 )■ 


(13) 


Consider a change of variables: ( 0 - 12 ,( 722 ) —>■ (u = cri 2 ,v = 0-22 — o-'i 2 ^ii^'^i 2 )- We then 
have the full conditionals: 


{u\Y,Z,'En,v) 


N 


{B+ diag(vi2^)} V,{B + diag(vi2^)} 


-1 


(v I Y,Z,Sii,u) -GIG(l-n/ 2 ,A,u'Sri^SiiEiTV- 2 s'i 2 SriV + S 22 ), 


(14) 


where B = Sn^SiiSn^n' + Asri', w = Sii^Si 2 W and GlG(g,a, 6) denotes the 
generalized inverse Gaussian distribution with a probability density function: 

2Kq{V^) 


with Kq as a modified Bessel function of the second kind. 


Surprisingly, Proposition (2) shows that the conditional distribution of any column 
(row) in S is also multivariate normal. This suggests direct column-wise block Gibbs 
updates of S. Sampling from p(Z | S, Y) is similar to that in (12) for p{Z \ fi, Y) with 
only the modification of replacing with aij. 


5 Effectiveness of the new methods 

5.1 Computational efficiency 

The computational speed and the scalability of SSSL block Gibbs samplers are evaluated 
empirically. The data of dimension p G {50,100,150, 200, 250} and sample size n = 2p 
are first generated from N(0,lp) and then standardized. The samplers are implemented 
under the hyperparameters wo = 0.05, = 50,7r = 2/{p — 1) and A = 1. All chains are 

initialized at the sample covariance matrix. All computations are implemented on a six- 
core CPU 3.33GHz desktop using MATLAB. For each run, we measure the time it takes 
the block Gibbs sampler to sweep across all columns (rows) once, which is called one 
iteration. One iteration actually updates each element Oy twice: once when updating 
column i and again when updating column j. This property improves its efficiency. The 
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solid and dashed curves in Figure 3 display the minutes taken for 1000 iterations versus 
p for covariance graph models and concentration graph models respectively. 

Overall, the SSSL algorithms run fast. Covariance graph models take approximately 
2 and 9 minutes to generate 1000 iterations for p = 100 and 200; concentration graph 
models take even less time, approximately 1.2 and 5 minutes. The relatively slower speed 
of covariance graph models is due to a few more matrix inversion steps in updating the 
columns in S. We also measure the mixing of the MCMC output by calculating the 
inefficiency factor 1 + 2 p{k) where p{k) is the sample autocorrelation at lag k. We 
use 5000 samples after 2000 burn-ins and K lags in the estimation of the inefficiency 
factors, where K = argmin;.{p(fc) < 2/^/M,k > 1} with M = 5000 being the total 
number of saved iterations. The median inefficiency factor among all of the elements of 
n was 1 when p = 100, further suggesting the efficiency of SSSL. In our experience, a 
MCMC sample of 5000 iterations after 2000 burnins usually generates reliable results 
in terms of Monte Carlo errors for p = 100 node problems, meaning that the computing 
time is usually less than 10 minutes, far less than the few days of computing time 
required by the existing methods. 



Figure 3: Time in minutes for 1000 iterations of SSSL versus dimension p for covariance 
graph models (solid) and concentration graph models (dashed). 


5.2 Structure learning accuracy 

The preceding section shows tremendous computational gains of SSSL over existing 
methods. This section evaluates these methods on their structure learning performance 
using two synthetic scenarios, both of which are motivated by real-world applications. 

Scenario 1. The first scenario mimics the dependence pattern of daily currency 
exchange rate returns, which has been previously analyzed via concentration graph 
models (Carvalho and West, 2007; Wang et ah, 2011). We use two different synthetic 
datasets for the two types of graphical models. For concentration graph models, Jones 
et al. (2005) generated a simulated dataset of p = 12 and n = 250 that mimics the 
exchange rate return pattern. We use their original dataset downloaded from their 
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website. The true underlying concentration graph has 13 edges and is given by Figure 
2 of Jones et al. (2005). For covariance graph models, we estimate a sparse covariance 
matrix with 13 edges based on the data of Jones et al. (2005) and then use this sparse 
covariance matrix to generate n = 250 samples. The true sparse covariance matrix is as 
follows: 

/ 0.239 0.117 0.031 \ 

0.117 1.554 

0.362 0.002 

0.002 0.199 0.094 

0.094 0.349 - 0.036 

0.295 - 0.229 0.002 

- 0.229 0.715 


0.031 

0.002 

0.164 

0.112 

- 0.028 

- 0.008 



0.112 

0.518 

- 0.193 

- 0.090 



- 0.028 

- 0.193 

0.379 

0.167 



- 0.008 

- 0.090 

0.167 

0.159 


\ - 0.036 0.207 / 

To assess the performance of structure learning, we compute the number of true 
positives (TP) and the number of false positives (FP). We begin by evaluating some 
benchmark models against which to compare SSSL. For concentration graph models, 
we consider the classical adaptive graphical lasso (Fan et ah, 2009) and the Bayesian 
G-Wishart prior Wg(3, Ip) in (1). For covariance graph models, we consider the classical 
adaptive thresholding (Cai and Liu, 2011) and the Bayesian G-inverse Wishart prior 
IWg(3,Ip) in (4). The two classical methods require some tuning parameters, which 
are chosen by 10-fold cross-validation. The adaptive graphical lasso has TP = 8 and 
FP = 7 for concentration graph models and the adaptive thresholding has TP = 13 and 
FP = 45 for covariance graph models. They seem to favor less sparse models, most likely 
because the sample size is large relative to the dimensions. The two Bayesian methods 
are implemented under the priors of graphs (2) with tt = 2/(p — 1). They perform better 
than the classical methods. The G-Wishart has TP = 9 and FP = 1 for concentration 
graph models, and the G-inverse Wishart has TP = 7 and FP = 0 for covariance graph 
models. 

We investigate the performance of SSSL by considering a range of hyperparameter 
values that represent different prior beliefs about graphs: tt G {2/{p— l),4/(p— 1), 0.5}, 
h G {10,50,100}, and vq G {0.02,0.05,0.1}. Under each hyperparameter setting, a 
sample of 10000 iterations is used to estimate the posterior median graph, which is 
then compared against the true underlying graph. Table 1 and 2 summarize TP and FP 
for concentration and covariance graphs respectively. Within each table, patterns can 
be observed by comparing results across different values of one hyperparameter while 
fixing the others. For vq, a larger value lowers both TP and FP because vq is positively 
related to the threshold of practical signihcance that Oy can be treated as zero. For tt, 
a larger value increases both TP and FP, and the case of tt = 0.5 seems to produce 
graphs that are too dense and have high FP, especially for the concentration graph 
models in Table 1. For h, increasing h reduces both TP and FP, partly because h is 
positively related to the threshold that can be practically treated as zero and in part 
because h is negatively related to the implied prior edge-inclusion probability Pv^Zij) 
as discussed in Panel (a) of Figure 1. Next, comparing the results in Tables 1-2 with 
the two Bayesian benchmarks reported above, we can see that SSSL is competitive. For 
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example, Table 2 shows that, except for the extreme cases of tt = 0.5 that favor graphs 
that are too dense or cases of vg = 0.1 that favor graphs that are too sparse, SSSL 
has approximately the same TP = 7 and FP = 0 as the G-inverse Wishart method for 
covariance graph models. 

Table 1: Summary of performance measures under different hyperparameters for a p = 


12 concentration graph example. As for benchmarks, the classical adaptive graphical 
lasso has TP=8 and FP=7; the Bayesian G-Wishart has TP=9 and FP=1. 



TT 

= 2 /( p - 

1 ) 


TT = 4 /(p - 1 ) 



TT ^ 0.5 


Vq 

h= 10 

h = 50 

h = 100 

h = 10 

h = 50 h . — 

100 

h= 10 

h = 50 

h = 100 





Number 

of true positives (TP) 




0.02 

9 

9 

9 

10 

10 

10 

10 

10 

10 

0.05 

10 

9 

9 

10 

10 

9 

10 

10 

10 

0.1 

9 

8 

8 

10 

8 

8 

10 

9 

8 





Number 

of false positives (FP) 




0.02 

3 

2 

0 

7 

3 

1 

14 

4 

3 

0.05 

2 

0 

0 

4 

1 

0 

8 

2 

0 

0.1 

1 

0 

0 

1 

0 

0 

4 

0 

0 


Table 2: Summary of performance measures under different hyperparameters for a 


p = 12 covariance graph example. As for benchmarks, the classical adaptive thresholding 
has TP=13 and FP=45; the Bayesian G-inverse Wishart has TP=7 and FP=0. 



TT 

= 2 /( p - 

1 ) 


TT = 4 /(p - 1 ) 



TT — 0.5 


Vq 

h = 10 

h = 50 

h = 100 

h = 10 

fi — 50 fi — 

100 

h = 10 

h = 50 

h = 100 

0.02 

7 

7 

7 

Number 

8 

of true positives (TP) 

7 7 

9 

7 

7 

0.05 

7 

7 

6 

7 

7 

7 

7 

7 

7 

0.1 

5 

3 

3 

6 

5 

4 

7 

5 

5 

0.02 

0 

0 

0 

Number 

1 

of false positives (FP) 

0 0 

5 

0 

0 

0.05 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0.1 

0 

0 

0 

0 

0 

0 

0 

0 

0 


Scenario 2. The second scenario mimics the dependence pattern of gene expression 
data, for which graphical models are used extensively to understand the underlying 
biological relationships. The real data are the breast cancer data (Jones et ah, 2005; 
Castelo and Roverato, 2006), which contain p = 150 genes related to the estrogen 
receptor pathway. Similar to the first scenario, we generate two synthetic datasets for 
the two graphical models. For concentration graph models, we first estimate a sparse 
concentration matrix with 179 edges based on the real data, and then generate a sample 
of 200 observations from this estimated sparse concentration matrix. For covariance 
graph models, we estimate a sparse covariance matrix with 101 edges based on the 
real data and then use this sparse covariance matrix to generate a synthetic data of 
n = 200 samples. Panels (a) and (b) of Figure 4 display the frequencies of the non-zero 
partial correlations and correlations implied by these two underlying sparse matrices, 
respectively. Among the nonzero elements, 13% of the partial correlations and 60% of 
the correlations are within 0.1. 

We repeat the same procedures of fitting the benchmark and proposed models as 
in Scenario 1. As for benchmarks, the adaptive graphical lasso has TP = 145 and 
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Figure 4: Histograms showing the empirical frequency of the non-zero elements of the 
partial correlation matrix for the first dataset (left) and of the nonzero elements of the 
correlation matrix for the second dataset (right) in Scenario 2 of p = 150. 


FP = 929 for concentration graph models, and the adaptive thresholding has TP = 14 
and FP = 12 for covariance graph models. The G-Wishart prior has TP = 105 and 
FP = 2 and takes about four days to run. The evaluation of the G-inverse Wishart is 
worth elaborating since existing experiments are conducted in smaller p settings and 
little is known about its performance in high-dimensional problems. 

The original model fitting algorithm for the G-inverse Wishart relies on the compu¬ 
tationally expensive importance sampling for approximating the normalizing constant 
and thus is slow and numerically instable for this p=150 dataset. Silva (2013) develops 
a new approximation that requires no Monte Carlo integration, which greatly speeds up 
the computation. The MATLAB routine implementing his algorithm is publicly avail¬ 
able on that paper’s website. We adopt these functions with one modification that sets 
the edge-inclusion probability to be 2/(p—1). The algorithm takes about 2 hours to com¬ 
plete 1000 sweeps, which is substantially slower than SSSL that costs about 5 minutes - 
see Figure 3. Since both SSSL and Silva (2013) require no Monte Carlo approximation, 
the difference in run time is a result of the fact that the SSSL’s block update of Z is 
faster than Silva (2013)’s one-edge-at-a-time update. 

Using the posterior median graph as an estimate of G, the G-inverse Wishart prior 
IWg( 3,I) produces TP = 3 and FP = 3. These two numbers are surprisingly small. An 
exploration of the G-inverse Wishart distribution provides some explanations. The main 
reason is that when G is sparse and p is large, the G-inverse Wishart prior inadvertently 
enforces the free elements in S towards zero and hence exerts a strong prior influence 
on the posterior distribution. To see this, suppose G is empty, then (4) implies that the 
diagonal elements {an} follow independent inverse Gamma distributions 

_ b+2p ^ - - 

p{an\b)(xa^^ ^ expj--;^}, i = l,...,p, 

ii 
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which clearly depend on the dimension p and converge to zero rapidly as p increases 
for a fixed b. Now suppose G is arbitrarily sparse. Although the theoretical marginal 
distributions of the free elements in S are unknown, the distribution of {an} under 
an empty graph leads us to conjecture that the free elements in S could be extremely 
concentrated around zero for large p as well. In our p = 150 example, a simulation from 
IWg(3,Ip) under the ground truth G that contains 101 edges supports this conjecture. 
The estimated prior mean of these 101 off-diagonal free elements {aij} is in the range 
of —6 X 10“® and 6 x 10“®; the estimated prior standard deviation is between 1.9 x 10“^ 
and 2.3 x 10“^. Such a tightly concentrated prior provides little probability support for 
the true graph. The implication on structure learning is that the Bayes factor might 
not truly respond to the data, but largely reflect prior prejudices that aij are extremely 
small. In other words, the overly concentrated prior does not allow the data to speak 
about G and consequently the posterior distribution of G is dominated by its prior. In 
fact, the posterior sample mean and standard deviation of the number of edges in Z 
computed from the MCMC output of Silva (2013) are 170 and 12.9, which are close 
to the prior expected number of edges, computed as ( 2 ) x = 150 and its standard 

deviation, computed as { 2 ) x x (1 — = 12.16. 

The fundamental cause of this strong prior influence is perhaps that the parameter 
space {6 : 6 > 0} assumed by Silva and Ghahramani (2009) is too restrictive. It might be 
reasonable to let the parameter space depend on G. For example, the standard inverse- 
Wishart theory implies that the parameter space should be {6 : 6 > 2 — 2p} when G 
is empty and so b could be even negative, and {6 : 6 > 0} when G is full. Hierarchical 
models that allow the value of b to be G-dependent might be helpful. A thorough 
examination along these lines is beyond the scope of the current paper. However, it is 
probably safe to conclude that further investigation should be called upon to follow the 
innovative framework of Silva and Ghahramani (2009). 

As for SSSL, Tables 3 and 4 summarize its performance. When the results are com¬ 
pared across different levels of one hyperparameter, the general relations between TP 
or FP and a hyperparameter are similar to those in Scenario 1. In fact, the patterns 
appear to be more significant in Scenario 2 because priors have greater influences in 
this relatively small sample size problem. When compared with benchmarks, SSSL is 
competitive. For concentration graph models. Table 3 suggests that, except for a few 
extreme priors that favor overly dense graphs, SSSL produces much sparser graphs than 
the classical adaptive graphical lasso, for which FP = 929 is too high. When vq = 0.02, 
SSSL is also comparable to the Bayesian G-Wishart prior. When uq increases, TP drops 
quickly because many signals are weak (Figure 4) and are thus treated as practically 
insignificant by SSSL. For covariance graph models. Table 4 suggests that SSSL gener¬ 
ally performs better than the adaptive thresholding. The only exceptions are cases in 
which the hyperparameters favor overly dense graphs (e.g., tt = 0.5) or overly sparse 
graphs (e.g., uq = 0 . 1 ). 

In summary, under sensible choices of hyperparameters, such as vq = 0.02, h = 50, 
and TT = 2/(p — 1), SSSL is comparable to existing Bayesian methods in terms of 
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Table 3; Summary of performance measures under different hyperparameters for a 


p = 150 concentration graph example. As for benchmarks, the classical adaptive lasso 
has TP=145 and FP=929; the Bayesian G-Wishart has TP=105 and FP=2. 
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78 
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89 

83 
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59 
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58 
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Number 
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0.02 

3 

0 

0 

9 

1 

0 
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0.05 

0 

0 
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0 

0 

0 
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39 

9 

0.1 
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0 

0 

0 

0 

0 
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4 
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Table 4: Summary of performance measures under different hyperparameters for a p = 
150 covariance graph example. As for benchmarks, the classical adaptive thresholding 


has TP=14 and FP=12; the Bayesian G-inverse Wishart cannot run. 
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structure learning accuracy. However, SSSL’s computational advantage of sheer speed 
and simplicity makes it very attractive for routine uses. 


6 Graphs for credit default swap data 

This section illustrates the practical utility of graphical models by applying them to 
credit default swap (CDS) data. CDS is a credit protection contract in which the buyer 
of the protection periodically pays a small amount of money, known as “spread”, to the 
seller of protection in exchange for the seller’s payoff to the buyer if the reference entity 
defaults on its obligation. The spread depends on the creditworthiness of the reference 
entity and thus can be used to monitor how the market views the credit risk of the 
reference entity. The aim of this statistical analysis is to understand the cross-sectional 
dependence structure of CDS series and thus the joint credit risks of the entities. The 
interconnectedness of credit risks is important, as it is an example of the systemic risk 
- the risk that many financial institutions fail together. 

The data are provided by Markit via Wharton Research Data Services (WRDS) 
and comprise daily closing CDS spread quotes for 79 widely traded North American 
reference entities from January 2, 2001 through April 23, 2012. The quotes are for five- 
year maturity, as five-year maturity is the most widely traded and studied term. The 
spread quote is then transformed into log returns for graphical model analysis. 
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To assess the variation in the graphs over time, we estimate graphs using a one- 
year moving window. In particular, at the end of each month t, we use the daily CDS 
returns over the period of month t — 11 to month t to estimate the graph Gt- The 
choice of a one-year window is intended to balance the number of observations, as well 
as to accommodate the time-varying nature of the graphs. The first estimation period 
begins in January 2001 and continues through December 2001, and the last is from 
May 2011 to April 2012. In total, there are 89 estimation periods, corresponding to 
89 time-varying graphs for each type of graph. We set the prior hyperparameters at 
Vo = 0.02, h = 50, TT = 2/(p— 1) and A = 1. The MCMC are run for 10000 iterations 
and the first 3000 iterations are discarded. The graphs are estimated by the posterior 
median graph. 



Figure 5: Time series plots of the estimated numbers of edges for the covariance graph 
(solid line), the concentration graph (dashed), and the common graph (dash dotted). 


Figure 5 shows changes over time of the estimated number of edges for the two types 
of graphs and the number of common edges. The numbers of edges are in the range of 40 
and 160 out of a total of 3081 possible edges, indicating a very high level of sparsity. At 
each time point, both types of graph reflect about the same level of sparsity. Over time, 
they show similar patterns of temporal variations. There is a steady upward trend in the 
number of edges starting from mid 2007 for both types of graph. The timing of the rise 
of the number of edges is suggestive. Mid-2007 saw the start of the subprime-mortgage 
crisis, which was signified by a number of credit events, including bankruptcy and the 
significant loss of several banks and mortgage firms, such as Bear Stearns, GM finance, 
UBS, HSBC and Countrywide. If the number of edges can be viewed as the “degree of 
connectedness” among CDS series, then this observed increase implies that the market 
tends to be more integrated during periods of credit crisis and consequently tends to 
have a higher systemic risk. 

To further illustrate the interpretability of graphs. Figure 6 provides four snapshots 
on graph details at two time points, in December 2004 and in December 2008. The 
covariance graph for December 2004 (Panel a) has 44 edges and 30 completely isolated 
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(a): Covariance Graph, December 2004 


(b) Covariance Graph, December 2008 





panels are: Panel (a), covariance graph in December 2004; Panel (b), covariance graph 
in December 2008; Panel (c), concentration graph in December 2004; and Panel (d) 
concentration graph in December 2008. 


nodes, as opposed to the covariance graph for December 2008 (Panel b), which has 128 
edges and no isolated nodes. These differences suggest that the connectedness of the 
network rises as the credit risks of many reference entities become linked with each other. 
The same message is further confirmed by concentration graphs. The concentration 
graph for December 2004 (Panel c) has 55 edges and 20 completely isolated nodes, 
while the concentration graph for December 2008 (Panel d) has 135 edges and no isolated 
nodes. The increase of connectedness is also manifested by the fact that every pair of 
nodes in the concentration graph on December 2008 is connected by a path. 

Finally, we zoom into subgraphs involving the American International Group (AIG) 
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to study whether the graphs make economic sense at the firm level. AIG provides a good 
example for study because it suffered a significant credit deterioration during the 2007- 
2008 crisis when its credit ratings were downgraded below “AA” levels in September 
2008. Panels (a) and (b) of Figure 7 show the covariance subgraph in December 2004 
and in December 2008. The subgraphs show that AIG experienced some interesting 
dependence structure shifts between these two periods. In December 2004, AIG formed 
a clique with three other major insurance companies: the Metropolitan Life Insurance 
Gompany (MET), the Chubb Corporation (CB) and the Hartford Financial Services 
Group (HIG). The credit risks of these four insurance companies are all linked to each 
other. By December 2008, the linkages between AIG and the other three insurance com¬ 
panies disappear, while the linkages among the other three firms remain. On the other 
hand, AIG is now connected with two non-insurance financial companies, the GE Cap¬ 
ital (GE) and the American Express Company (AXP). The concentration subgraphs of 
AIG displayed in Panels (c) and (d) of Figure 7 show similar structural shift patterns as 
in the covariance graphs, although here dependence functions differently as conditional 
dependence. AIG initially formed a prime component with the other three insurance 
firms in December 2004. The connections between AIG and the other insurance firms 
were severed in December 2008, and new connections between AIG and GE and between 
AIG and AXP arose. 

Given that AIG suffered a more severe credit crisis than other insurance companies, 
the uncovered network shift might indicate that the CDS market was able to adjust its 
view regarding the credit risk of AIG and treat it as unrelated to its peer insurance 
firms during the credit crisis. Such timely changes in dependence structures may shed 
light on the question of the information efficiency of the CDS market, from a point of 
view that is different from the usual analysis of changes in levels. 


(a) CGM 2004 



(b) CGM 2008 


(c) GGM 2004 (d) GGM 2008 



Figure 7: Subgraphs involving AIG estimated for two different time periods. There are 
four snapshots: Panel (a), covariance graph in December 2004; Panel (b), covariance 
graph in December 2008; Panel (c), concentration graph in December 2004; and Panel 
(d) concentration graph in December 2008. The six subgraph nodes are: American 
International Group (AIG); GE Capital (GE); Metropolitan Life Insurance Company 
(MET); The Chubb Corporation (CB); Hartford Financial Services Group (HIG); and 
American Express Company (AXP). 










372 


Stochastic Search Structure Learning In Graphical Models 


7 Conclusion 

This paper proposes a new framework for the structure learning of concentration graph 
models and covariance graph models. The main goal is to overcome the scalability 
limitation of existing methods without sacrificing structure learning accuracy. The key 
idea is to use absolutely continuous spike and slab priors instead of the popular point- 
mass mixture priors to enable accurate, fast, and simple structure learning. Our analysis 
suggests that the accuracy of these new methods is comparable to that of existing 
Bayesian methods, but the model-fitting is vastly faster to run and simpler to implement. 
Problems with 100-250 nodes can now be fitted in a few minutes, as opposed to a few 
days or even numeric infeasibility under existing methods. This remarkable efficiency 
will facilitate the application of Bayesian graphical models to large problems and will 
provide scalable and effective modular structures for more complicated models. 

The focus of the paper is on the structure learning of graphs. A related yet different 
question is the parameter estimation of the covariance matrix. Since our priors place zero 
probability mass on any sparse matrix containing exact zeros, as opposed to the point- 
mass mixture priors, one concern is then about where and how the posterior of O or S 
will concentrate when the true covariance/concentration matrix is sparse. Our limited 
experiments suggest that the posterior distribution from the proposed two-component 
normals is indeed more dispersed around zero than those from the point-mass mixture 
priors. Take the concentration graph in Scenario 1 as an example. Under SSSL, the 
average magnitude of the posterior mean estimates of these {w^} corresponding to the 
true zeros in H is in the range of 0.0085 and 0.045, depending on the hyperparameters 
of {vQjhjir). In contrast, under the G-Wishart prior, the estimates of these {w^-} have 
a smaller average magnitude of 0.0040. The small-variance normal shrinks parameters 
less aggressively than the point-mass mixture. Although we do not find it problematic 
for structure learning, such weaker shrinkage may cause the SSSL’s performance of pa¬ 
rameter estimation to be suboptimal. A refinement is to replace the normal distribution 
with distributions having higher densities near zero. The Bayesian shrinkage regres¬ 
sion literature shows that some heavy-tailed distributions offer comparable parameter 
estimation performance to the point-mass mixture priors (e.g., Armagan et al. 2013; 
Griffin and Brown 2010; Carvalho et al. 2010). Computational tractability of SSSL is 
maintained by applying the data-augmentation to the mixture normal representation 
of these alternative distributions. 


Appendix 

Proof of Proposition 1 

Clearly, the conditional distribution of fl given the edge-inclusion indicators Z is 
p{n I Y,Z) oc |J7|texp{-tr(isf2)}]^|exp(-^^)|]^|exp(-^WiO|- (15) 
Under the partitions (10), the conditional distribution of the last column in n is 
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p(u>12,^22 I Y, Z, Oil) C)C (uJ22 — <^12^11^‘^12) ^ 

exp [ - i{a;'i2D^^a;i2 + 2si2u;i2 + (S22 + A)w22}], 

where D = diag(vi2). Consider a change of variables (u, ^22) —>■ (u = a;i2,i' = W22 — 
whose Jacobian is a constant not involving (a;i2,w). So 


p{u, w I Y, Z, Oil) oc 2 exp(— ^ ^ v) exp 


— — [u'{D ^ + (s22 + A) 02 i^}u+ 2 s)^ 2 u] 


This implies that: 

{u,v) I (Oii,Z,Y)^N(-Csi 2 ,C)Ga(| + 1 ,^HH^), 

where C = {(522 + A)Ori^ + 


Proof of Proposition 2 

Given edge-inclusion indicators Z, the conditional posterior distribution of S is 
p(S I Y,Z) oc |S|-t exp{-itr(SS-i)}]^ jexp(-^f-)| jexp(-^cri,)|. (16) 


Under partitions (13), consider a transformation {(Ti 2 ,a 22 ) —?► (u = <ti 2 ,u = 1722 — 
^ 12) ^ whose Jacobian is a constant not involving (u, u) and apply the block 
matrix inversion to S using blocks (Sii,u, u): 

v-i - ( + sri^u'sri^-i -sri^u-i 

y — v~^ 



After removing some constants not involving (u, u), the terms in (16) can be expressed 
as a function of (u,i;): 


|S| 



tr(sx;"^) 

n{ 

»p(-;„!.)} 



n 

i 

|exp(-^cr,o| 


oc 

oc 

oc 


oc 


V, 

u'SCj^SiiSC;^^u-!;“^ - 2si2^n^uu“^ -|- 522 ?^”^, 
exp(-^u'D"^u), 

exp(-^A(u'S)'i^u-|-u)), 


where D = diag(vi 2 ). Holding all but (u, v) fixed, we can then rewrite the logarithm of 
(16) as 
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logp(u,?; I-) =--^<1 nlog(?;) + ^ - 2 s'^ 2 ^ii^u^ ^ + S 22 V ^ 


-u'D + Au'Sii^u + A?; 


constant. 


This gives the conditionals of u and v as 

(u I Y, Z, ^n,v) ^ N{(B + D"!)- V, (B + D-^)-!}, 

(v I Y,Z,Sii,u) -GIG(l-n/2,A,u'SniSiiEnV-2s'i2SnV + S 22 ), 


where B = ^ w = ^ and GIG(g,a, 6) denotes the 

generalized inverse Gaussian distribution with probability density function; 


p(x) 


(a/6)'?/^ (q-l) -(a^+fc/x)/2 

2K,(V^) 


with Kg a modified Bessel function of the second kind. 


Supplementary materials 

The MATLAB routines implementing all frequentist and Bayesian procedures used in 
the paper are available from the author’s web site of the paper at: https://www.msu. 
\edu/~haowang/RESEARCH/SSSL/sssl.html. 
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